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J^j Abstract 

£i 

d Bayesian phylogenetic methods are generating noticeable enthusiasm in 

the field of molecular systematics. Many phylogenetic models are often at 
stake and different approaches are used to compare them within a Bayesian 
framework. The Bayes factor, defined as the ratio of the marginal likelihoods 
of two competing models, plays a key role in Bayesian model selection. We 
focus on an alternative estimator of the marginal likelihood whose computa- 
tion is still a challenging problem. Several computational solutions have been 
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proposed none of which can be considered outperforming the others simulta- 
neously in terms of simplicity of implementation, computational burden and 
precision of the estimates. Practitioners and researchers, often led by available 
software, have privileged so far the simplicity of the harmonic mean estimator 
(HM) and the arithmetic mean estimator (AM). However it is known that the 
resulting estimates of the Bayesian evidence in favor of one model are biased 
and often inaccurate up to having an infinite variance so that the reliability 
of the corresponding conclusions is doubtful. Our new implementation of the 
generalized harmonic mean (GHM) idea recycles MCMC simulations from the 
posterior, shares the computational simplicity of the original HM estimator, 
but, unlike it, overcomes the infinite variance issue. The alternative estima- 
tor is applied to simulated phylogenetic data and produces fully satisfactory 
results outperforming those simple estimators currently provided by most of 
the publicly available software. 

keywords: Bayes factor, harmonic mean, importance sampling, marginal likeli- 
hood, phylogenetic models. 



1 Introduction 



The theory of evolution states that all organisms are related through a history of 
common ancestor and that life on Earth diversified in a tree-like pattern connecting 
all living species. Phylogenetics aims at inferring the tree that better represents 
the evolutionary relationships among species studying differences and similarities in 
their genomic sequences. Alternative tree estimation methods such as parsimony 



methods (Felsenstein (2004), chapter 7) and distance methods (Fitch and Margo- 



liash, 1967 Cavalli-Sforza and Edwards, 1967) have been proposed. In this paper 



we will focus on stochastic models for substitution rates and we address the model 
choice issue within a fully Bayesian framework proposing an alternative model ev- 
idence estimation procedure. The paper is organized as follows: in Section [2] we 
briefly review basic phylogenetic concepts and the Bayesian inference for substi- 
tution models. In Section |3] we focus on the model selection issue for substitution 
models and discuss some available computational tools for Bayesian model evidence. 
One of the most popular tool for computing model evidence in phylogenetics is the 
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Harmonic Mean (HM) estimator proposed by Newton and Raftery ( 1994 ) as an easy- 
to-apply instance of a more general class of estimators called Generalized Harmonic 
Mean (GHM). An alternative version of GHM is considered in Section [4j It has 
been introduced in Petris and Tardella (2007) under the name of Inflated Density 
Ratio (IDR) and its implementation for substitution models is described in Section 
[5] Numerical examples and comparative performance are given in Section [6} We 
conclude with a brief discussion in Section [71 



2 Substitution models: a brief overview 

Phylogenetic data consists of homologous DNA strands or protein sequences of re- 
lated species. Observed data consists of a nucleotide matrix X with n rows rep- 
resenting species and k columns representing sites. Comparing DNA sequences of 
two related species, we define substitution the replacement in the same situs of one 
nucleotide in one species by another one in the other species. The stochastic mod- 
els describing this replacement process are called substitution models. A phylogeny 
or a phylogenetic tree is a representation of the genealogical relationships among 
species, also called taxa or taxonomies. Tips (leaves or external nodes) represent 
the present-day species, while internal nodes usually represent extinct ancestors for 
which genomic sequences are no longer available. The ancestor of all sequences is the 
root of the tree. The branching pattern of a tree is called topology, and is denoted 
with r, while the lengths v T of the branches of the tree r represent the time periods 
elapsed until a new substitution occurs. 

DNA substitution models are probabilistic models which aim at modeling changes 
between nucleotides in homologous DNA strands. Changes at each site occur at 
random times. Nucleotides at different sites are usually assumed to evolve indepen- 
dently each other. For a fixed site, nucleotide replacements over time are modeled by 
a 4-state Markov process, in which each state represents a nucleotide. The Markov 
process indexed with time t is completely specified by a substitution rate matrix 
Q(t) = Tij(t): each element r^it), i ^ j, represents the instantaneous rate of substi- 
tution from nucleotide i to nucleotide j. The diagonal elements of the rate matrix are 
defined as rait) = J2j f <=i r ij(t) so that X^=i r y(^) = 0> The transition probability 
matrix P{t) = {pij(t)}, defines the probability of changing from state i to state j. 
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The substitution process is assumed homogeneous over time so that Q(t) = Q and 
P(t) = P. It is also commonly assumed that the substitution process at each site 
is stationary with equilibrium distribution II = (it a, Kc, ^g, ttt) an d time-reversible, 
that is 



(2.1) 



where 7Tj is the proportion of time the Markov chain spends in state % and TTj/Vj is 



the amount of flow from state i to j. Equation (2.1) is known as detailed-balance 



condition and means that flow between any two states in the opposite direction is the 

Pij "i > 



same. Following the notation in Hudelot et al. (2003), we define Tij(t) 



r 



Vz 7^ j, where p^ is the transition rate from nucleotide i to nucleotide j. This 
reparameterization is particularly useful for the specification of substitution models, 
since it makes clear the distinction between the nucleotide frequencies ha, t^g, ttc, 
and substitution rates py, allowing to spell out different assumptions on evolutionary 
patterns. The most general time-reversible nucleotide substitution model is the so- 
called GTR defined by the following rate matrix 



Q 



( - Pac^c Pag^g Pat^t \ 

PAC^A - PCG^G PCT^T 

Pag^a Pcg^c - Pgt^t 

\ Pat^a Pct^c Pgtttg - ) 



(2.2) 



and more thoroughly illustrated in Lanave et al. (1984). Several substitution models 



can be obtained simplifying the Q matrix reflecting specific biological assumptions: 



the simplest one is the JC69 model, originally proposed in Jukes and Cantor (1969) 



which assumes that all nucleotides are interchangeable and have the same rate of 
change, that is p^ = p Vi, j and tta = = ^g = ^t- 

In this paper, for illustrative purposes we will consider only instances of GTR and 



JC69 models. One can look at Felsenstein (2004) and Yang (2006) for a wider range 



of alternative substitution models. 

2.1 Bayesian inference for substitution models 

The parameter space of a phylogenetic model can be represented as 



Q = {t, u t , 9} 
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where r G T is the tree topology, v T the set of branch lengths of topology r, and 
6 = (p, 7r) the parameters of the rate matrix. We denote the cardinality of T . 
Notice that Nt is a huge number even for few species. For instance with n = 10 
species there are about Nt ~ 2 • 10 6 different trees. 

Observed data consists of a nucleotide matrix X once specified the substitution 
model M, the likelihood p(X\r, u T , 8, M) can be computed using the pruning algo- 



rithm, a practical and efficient recursive algorithm proposed in Felsenstein (1981). 
One can then make inference on the unknown model parameters looking for the val- 
ues which maximize the likelihood. Alternatively one can adopt a Bayesian approach 
where the parameter space is endowed with a joint prior distribution tt(t, u t , 9) on 
the unknowkn parameters and the likelihood is used to update the prior uncertainty 
about (r, u T ,0) following the Bayes' rule: 

p(X\T,v T ,e,M)n{T,u T ,9) 



V^r\X,M) 



where 



m{X\M) = X) / / P( X l r ' v t> 9 > M ) 7r ( r > v t, 0)dv T d6 
reT T 

The resulting distribution p(r, u T ,0\X, M) is the posterior distribution which coher- 
ently combines prior believes and data information. Prior believes are usually con- 
veyed as it(t,u t ,0) = -k(t)'k(i' t )'k{0). The denominator of the Bayes' rule m(X\M) 
is the marginal likelihood of model M and it plays a key role in discriminating 
among alternative models. 

More precisely, suppose, we aim at comparing two competing substitution models 
M and Mi. The Bayes Factor is defined as the ratio of the marginal likelihoods as 
follows 

m{X\Mi) 



m(X\M ) 
where, for i — 0, 1 



c w = m{X\Mi) = V / / p{X\0,T,Mi)ir{0^)de^ir{v®\T)dv®Tt{T) (2.3) 



Numerical guidelines for interpreting the evidence scale are given in Kass and Raftery 



(1995). Values of BF W > 1 (log(BF w ) > 0) can be considered as evidence in favor 
of Mi but only a value of BF W > 100 (log(BF 10 > 4.6) can be really considered as 
decisive. 
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Most of the times the posterior distributions p(r, u T , 6\X, M;) and marginal like- 
lihoods are not analytically computable but can be approached through appropri- 
ate approximations. Indeed, over the last ten years, powerful numerical methods 
based on Markov Chain Monte Carlo (MCMC) have been developed, allowing one 
to carry out Bayesian inference under a large category of probabilistic models, even 
when dimension of the parameter space is very large. Indeed, several ad-hoc MCMC 



algorithms have been tailored for phylogenetic models (Larget and Simon, 1999 Li 



et at, 2000) and are currently implemented in publicly available software such as 



in MrBayes ( |Ronquist and Huelsenbeck| |2003[ ) and PHASE ( |Gowri-Shankar and 
Jow|[2006|. 



3 Model selection for substitution models 



Given the variety of possible stochastic substitution mechanisms, an important is- 
sue of any model-based phylogenetic analysis is to select the model which is most 
supported by the data. Several model selection procedures have been proposed de- 
pending also on the inferential approach. A classical approach to model selection for 
choosing between alternative nested models is to perform the hierarchical likelihood 
ratio test (LRT) ( |Posada and Crandall , 2001). A number of popular programs allow 



users to compare pairs of models using this test such as PAUP (Swafford, 2003) 



PAML (Yang 2007) and the R package APE (R Development Core Team 2008). 



However, Posada and Buckely (2004) have shown some drawbacks of performing 



systematic LRT for model selection in phylogenetics. This is because the model 
that is finally selected can depend on the order in which the pairwise comparisons 



are performed (Pol, 2004). Moreover, it is well-known that LRT tends to favor pa- 



rameter rich models. 

The Akaike Information Criterion (AIC) is another model-selection criterion com- 



monly used also in phylogenetics (Posada and Buckely, 2004): one of the advantages 



of the AIC is that it allows to compare nested as well as non nested models and 
it can be easily implemented. However, also the AIC tends to favor parameter- 
rich models. To overcome this selection bias one can use the Bayesian Information 



Criterion (BIC) (Schwartz, 1978) which better penalizes parameter-rich models. 
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Sometimes these criteria applied to the same data can end up selecting very 



different substitution models, as shown in Abdo et al. (2005). Indeed they compare 
ratios of likelihood values penalized for an increase in the dimension of one of the 
models, without directly accounting for uncertainty in the estimates of model pa- 
rameters. The latter aspect is addressed within a fully Bayesian framework through 
the use of the Bayes Factor. Bayes Factor directly incorporates this uncertainty and 
its meaning is more intuitive than other methods since it can be directly used to 
assess the comparative evidence provided by the data in terms of the most probable 
model under equal prior model probabilities. 



Bayes Factor for comparing phylogenetic models was first introduced in Sinsheimer 



et al. (1996) and Suchard et al. (2001). Since then its popularity in phylogenetics 



has grown so that some publicly available software provide in their standard output 
approximations of marginal likelihoods for model evidence and Bayes Factor evalu- 
ation. Indeed the complexity of phylogenetic models and the computational burden 
in the light of high-dimensional parameter space make the problem of finding alter- 
native and more efficient computational strategy for computing Bayes Factor still 



open and in continuous development (Lartillot et al, 2007; Ronquist and Deans 
20101. 



3.1 Available computational tools for Bayesian model evi- 
dence 

The computation of the marginal likelihood m(X\M) of a phylogenetic model M 
is not straightforward. It involves integrating the likelihood over fc-dimensional 
subspaces for the branch length parameters v T and the substitution rate matrix 
9 = (p, 7r) and eventually summing over all possible topologies. 
Most of the marginal likelihood estimation methods proposed in the literature have 



been applied extensively also in molecular phylogenetics (Minin et al, 2003 Lar 



tillot et al., 2007 Suchard et al, 2001). Among these methods, many of them are 



valid only under very specific conditions. For instance, the Dickey-Savage ratio 



(Verdinelli and Wasserman, 1995) applied in phylogenetics in Suchard et al. (2001) 



assumes nested models. Laplace approximation (Kass and Raftery, 1995) and BIC 
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(Schwartz, 1978), applied in phylogenetics firstly in Minin et al. (2003), require large 
sample approximations around the maximum likelihood, which can be sometimes 
difficult to compute or approximate for very complex models. A recent appealing 



variation of the Laplace approximation has been proposed in Rodrigue et al. (2007): 
however, its applicability and performance are endangered when the posterior dis- 
tribution deviates from normality and the maximization of the likelihood can be 
neither straightforward nor accurate. 



The reversible jump approach (Green, 1995 Bartolucci et al, 2006) is another 



MCMC option applied to phylogenetic model selection in Huelsenbeck et al. (2004). 
Unfortunately the implementation of this algorithm is not straightforward for the 
end user and it often requires appropriate delicate tuning of the Metropolis Hast- 
ings proposal. Moreover its implementation suffers extra difficulties when comparing 



models based on an entirely different parametric rationale (Lartillot et al. 2007). 



As recently pointed out in Ronquist and Deans (2010) among the most widely 
used methods for estimating the marginal likelihood of phylogenetic models are the 
thermodynamic integration, also known as path sampling, and the harmonic mean 



approach. The thermodynamic integration reviewed in Gelman and Meng (1998) 



and first applied in a phylogenetic context in Lartillot and Philippe (2006) produces 
reliable estimates of Bayes Factors of phylogenetic models in a large varieties of 
models. Although this method has the advantage of general applicability, it can 
incur high computational costs and may require specific adjustments. For certain 
model comparisons, a full thermodynamic integration may take weeks on a modern 
desktop computer, even under a fixed tree topology for small single protein data sets 



(Rodrigue et al. , 2007 Ronquist and Deans, 2010). On the other hand, the HM esti- 
mator can be easily computed and it does not demand further computational efforts 
other than those already made to draw inference on model parameters, since it only 
needs simulations from the posterior distributions. However, it is well known that 
the HM estimator is unstable since it can end up with an infinite variance. As high- 



lighted by Ronquist and Deans (2010), thermodynamic integration and reversible 
jump are, until now, the most accurate tools for computing the marginal likelihood. 
However, until these methods become more user-friendly and more widely available, 
simple tools for exploring in a quicker way the more interesting models are useful. 
For this reason in the next sections we focus on an alternative generalized harmonic 
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mean estimator, the IDR estimator, which shares the computational simplicity of 
HM estimator but, unlike it, better copes with the infinite variance issue. Its simple 
implementation makes the IDR estimator a useful and more reliable method for eas- 
ily comparing competing substitution models. It can be used also as a confirmatory 
tool even in those models for which more complex estimation methods, such as the 
path sampling, can be applied. 



3.2 Harmonic Mean estimators 



We introduce the basic ideas and formulas for the class of estimators known as 
Generalized Harmonic Mean (GHM). Since the marginal likelihood is nothing but 
the normalizing constant of the unnormalized posterior density, we illustrate the 
GHM estimator as a general solution for estimating the normalizing constant of a 
non-negative, integrable density g defined as 



g(0)de 



(3.1) 



where 9 G Q C 5i fc and g(9) is the unnormalized version of the probability distribu- 
tion g{6). The GHM estimator of c is based on the following identity 

' (3-2) 



En 



(mX 



where / is a convenient instrumental Lebesgue integrable function which is only 
required to have a support which is contained in that of g and to satisfy 



f(9)d9 



(3.3) 



The GHM estimator, denoted as cqhm is the empirical counterpart of (3.2), namely 

1 



CGHM 



1 V T ' 

T 



(3.4) 



where 9i, 92, ■■■,9t are sampled from g. In Bayesian inference the very first instance 



of such GHM estimator was introduced in Gelfand and Dey (1994) to estimate 



the marginal likelihood considered as the normalizing constant of the unnormalized 
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posterior density g{9) = ir(8)L(9). Hence, taking f{9) = ir{9) one obtains as special 



case of (3.4) the Harmonic Mean estimator 



CRM 



(3.5) 



L(9t) 



which can be easily computed by recycling simulations 6i,...,9t from the target 
posterior distribution g(9) available from MC or MCMC sampling scheme. This 
probably explains the original enthusiasm in favor of chm which indeed was con- 
sidered a potential convenient competitor of the standard Monte Carlo Importance 
Sampling estimate given by the (Prior) Arithmetic Mean (AM) estimator 



cam 



htm 

t=i 



(3.6) 



where 9±, ...,9t are sampled from the prior tt. 



The implementation of and ,more generally, (3.4) requires a relatively light com 



putational burden hence reducing computing time with respect to thermodynamic 
integration. The simplicity of the computation has then favored the widespread 
use of the Harmonic Mean estimator with respect to more complex methods. In 
fact, the Harmonic Mean estimator is implemented in several Bayesian phylogenetic 
software as shown in Table [l] and recent biological papers (Yamanoue et al, 2008 



Wang et al. , 2009 Normana et al. , 2009 ) report the HM as a routinely used model 



selection tool. 



Table 1 about here 



However, both cam and Chm can end up with a very large variance and unstable 
behavior. This fact cannot be considered as an unusual exception but it often occurs 
and the reason for that can be argued on a theoretical ground. 

For cam the erratic behavior is simply explained by the fact that the likelihood 
usually gives support to a region with low prior weight hence sampling from the 
prior yields low chance to hit high likelihood region and large chance to hit much 
lower likelihood region ending up in a large variance of the estimate cam- Indeed, 



starting from the original paper Newton and Raftery (1994), (see in particular R. 
Neal's discussion) it has been shown that even in very simple and standard gaussian 
models also the HM estimator can end up having an infinite variance hence yielding 
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unreliable approximations. This fact raises sometimes the question whether they 
are reliable tools and certainly has encouraged researchers to look for alternative 
solutions. Several generalizations and improved alternatives have been proposed 



and recently reviewed in Raftery et al. (2007) 



In the following sections we will consider a new marginal likelihood estimator, 



the Inflated Density Ratio (IDR) estimator, proposed in Petris and Tardella (2007) 



which is a particular instance of the Generalized Harmonic Mean (GHM) approach. 
This new estimator basically shares the original simplicity and the computation 
feasibility of the HM estimator but, unlike it, it can guarantee important theoretical 
properties, such as a bounded variance. 



4 IDR: Inflated Density Ratio estimator 

The inflated density ratio estimator is a different formulation of the GHM estimator, 
based on a particular choice of the instrumental density f{9) as originally proposed 
in Petris and Tardella (2007). The instrumental f{9) is obtained through a pertur- 



bation of the original target function g. The perturbed density, denoted with g Pk , is 
defined so that its total mass has some known functional relation to the total mass 



c of the target density g as in (3.1). In particular, gp k is obtained as a parametric 
inflation of g so that 



g Pk ^)= C +k 



(4.1) 



where k is a known inflation constant which can be arbitrarily fixed. The perturba- 



tion device comes from an original idea in Petris and Tardella (2003|) and is deteiled 
in 



Petris and Tardella (2007) for unidimensional and multidimensional densities. In 



the unidimensional case the perturbed density is 

g{6 + r k ) iie < -r k 
9p k (V) = { 9(0) if-r k <6<r k 
g(0 - r k ) H9>r k 



(4.2) 
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with 2rfc = corresponding to the length of the interval centered around the 
origin where the density is kept constant. In Figure [T] one can visualize how the 
perturbation acts. The perturbed density allows one to define an instrumental 
density fk{@) = 9Pk ^\ 9 ^ which satisfies the requirement (3.3) needed to define the 



GHM estimator as in (3.4). The Inflated Density Ratio estimator cjdr for c is then 
obtained as follows 



k 



ClDR 



1 9P k {dt) 



E 



g(6 t ) 



(4.3) 



where 9\, ...,9t is a sample from the normalized target density g. The use of the 
perturbed density as importance function leads to some advantages with respect to 
the other instances of cqhm proposed in the literature. In fact cjdr defined as in Q 
yields a finite-variance estimator under mild sufficient conditions and a wide range 



of g densities Petris and Tardella (2007) (Lemma 1, 2 and 3). Notice that in order 
for the perturbed density gp k to be defined it is required that the original density g 
has full support in 3f? fe . Moreover, the use of a parametric perturbation makes the 
method more flexible and efficient with a moderate extra computational effort. 



Like all methods based on importance sampling strategies, the properties of the 
estimator cjdr strongly depend on the ratio 9Pk rJP ■ To evaluate its performance one 
can use an asymptotic approximation (via standard delta-method) of the Relative 
Mean Square Error of the estimator 



RMSE, 



CIDR 



\ 



E~ n 



ClDR - C 
C 



Var 



9pM 
[ 9(0) J 



RMSEg ID Delta (4.4) 



which can be ultimately estimated as follows: 
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Figure 1: Left Panel: original density g with total mass c. Right Panel: perturbed 



density gp k defined as in (4-2). The total mass of the perturbed density is then c + k. 
The shaded area correspond to the inflated mass k with k — 2 • • g(0) as in (4-2). 



RMSE 



ClDR 



C IDR, Delta 



Var g 



9pM 
. 9(0) 



(4.5) 



where Var g is the observed sample variance of the ratio gp k (8)/g(6). 



The expression in Equation (|4.5|) clarifies the key role of the choice of k with respect 

tends to 0, 

gpuW 



to the error of the estimator: for k — > 0, the variance of the ratio 9Pk , ( f) 

' 9(0) 

since gp k is very close to g, but | tends to infinity: in other words, if Var g 
would favor as little values of k as possible, t acts in the opposite direction. In 



order to address the choice of fc, Petris and Tardella (2007) suggested to choose the 



perturbation k which minimizes the estimation error (4.5). In practice, one can cal- 



culate the values of the estimator for a grid of different perturbation values, ciDFi(k), 
k = 1,...,K and choose the optimal k opt as the k for which RMSEg IDR (u\ is mini- 
mum. This procedure for the calibration of k requires iterative evaluation of ciDn{k) 
hence is relatively heavier than the HM estimator, but it does not require extra sim- 
ulations which in the phylogenetic context is often the main time-consuming part. 
Hence, the computational cost is alleviated by the fact that one uses the same sam- 
ple from g and the only quantity to be evaluated K times is the inflated density gp k . 
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Once obtained the ratio of the perturbed and the original density, the computation 
of cjDfi(k) is straightforward. 

For practical purposes, the computation of the inflated density when the support 
of g is the whole 9ft fc can be easily implemented in R using a function available at 



http : // sites . google . com/ site/ idrharmonicmean/home. In Petris and Tardella 



(2007) and Arima (2009) it has been shown that in order to improve the precision of 



cidr it is recommended to standardize the simulated data 9±, 8t with respect to 
a (local) mode and the sample variance-covariance matrix so that the correspond- 
ing density has a local mode at the origin and approximately standard variance- 
covariance matrix. This is automatically implemented in the publicly available R 
code. 

In order to assess its effectiveness, the IDR method has been applied to simu- 
lated data from differently shaped distributions for which the normalizing constant 



is known. As shown in Petris and Tardella (2007), the estimator produces fully 



convincing results with simulated data from several known distributions, even for a 
100-dimensional multivariate Gaussian distribution. In terms of estimator precision, 



these results are comparable with those in Lartillot and Philippe (2006) obtained 



with the thermodynamic integration. In Arima (2009) simple antithetic variate 



tricks allow the IDR estimator to perform well even for those distributions with 
severe variations from the symmetric Gaussian case such as asymmetric and even 
some multimodal distributions. Table [2] shows the estimates obtained by applying 
the IDR method in several controlled scenarios: the method correctly reproduces 
the true value of the normalizing constant for different shape and dimension of the 
target function. Some real data implementation with standard generalized linear 



models have been also reported in Petris and Tardella (2007). In the next Section 



we extend the IDR method in order to use cidr in more complex settings such as 
phylogenetic models. 



Table 2 about here 
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5 Implementing IDR for substitution models with 
fixed topology 

We extend the Inflated Density Ratio approach in order to compute the marginal 
likelihood of phylogenetic models. In this section we show how to compute the 
marginal likelihood when it involves integration of substitution model parameters 9 
and the branch lengths v T which are both defined in continuous subspaces. Indeed 
the approach can be used as a building block to integrate also over the tree topology 
r. 

For a fixed topology r and a sequence alignment X, the parameters of a phylogenetic 
model M T are denoted as u = (9, v T ) C Q T . The joint posterior distribution on u is 
given by 

p{V,u T \X,M T )- m ( X \M T ) [ j 

where 



m(X\M T )= / / p(X\e,v T ,M T )ir(9)ir(u T )d9du T (5.2) 
Je Jv T 

is the marginal likelihood we aim at estimating. 

When the topology r is fixed, the parameter space Q T is continuous. Hence, in order 
to apply the IDR method we only need the following two ingredients: 

• a sample (9^\ z/f ), (9^ T \ v^) from the posterior distribution, p(9, u T \X, M T ) 

• the likelihood and the prior distribution evaluated at each posterior sampled 
value (#W,z4 fc) ), that is p(X\9^ k \ 4 k \ M T ) and n(9^ k \4 k) ) = n(9^)7r(u { T k) ) 

The first ingredient is just the usual output of the Monte Carlo Markov Chain sim- 
ulations derived from model M and data X. The computation of the likelihood and 
the joint prior is usually already coded within available software. The first one is 
accomplished through the pruning algorithm while computing the prior is straight- 
forward. Indeed a necessary condition for the inflation idea to be implemented as 



prescribed in Petris and Tardella (2007) is that the posterior density must have full 
support on the whole real /c-dimensional space. In our phylogenetic models this is 
not always the case hence we explain simple and fully automatic remedies to over- 
come this kind of obstacle. 

We start with branch length parameters which are constrained to lie in the positive 
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half-line. In that case the remedy is straightforward. One can reparameterize with 
a simple logarithmic transformation 



v T = log(y T ) (5.3) 

so that the support corresponding to the reparameterized density becomes uncon- 
strained. Obviously the log[y T ) reparameterization calls for the appropriate Jaco- 
bian when evaluating the corresponding transformed density. For model parameters 
with linear constraints like the substitution 9 = {p,ir}, a little less obvious trans- 
formation is needed. In this case 9 = {p, tt} are subject to the following set of 
constraints: 

E ** = 1 

i£{A,T,C,G} 

Pij*j = V ie{A,T,C,G} 

je{A,T,C,G} 

Similarly to the first simplex constraint the last set of constraints together with the 



reversibility can be rephrased (Gowri-Shankar, 2006) in terms of another simplex 



constraint concerning only the extra-diagonal entries of the substitution rate matrix 



(2.2) namely 

Pac + Pag + Pat + Pcg + Per + Pgt = 1- 
In order to bypass the constrained nature of the parameter space we have relied on 



the so-called additive logistic transformation (Tiao and Cuttman, 1965 Aitchinson 



1986) which is a one-to-one transformation from M. D ~ l to the (D — 1) -dimensional 
simplex 

S D = {(x\, xd) : x\ > 0, ...,xd > 0; x% + ...xd = 1} • 

Hence we can use its inverse, called additive log-ratio transformation, which is defined 
as follows 

yi = log( — ) i = l,...,D -1 

for any x = (xi, ...,Xd) £ S D . Here the Xj's are the p's and D = 6. Applying these 
transformations to nucleotide frequencies 7Tj and to exchangeability parameters p's, 
the transformed parameters assume values in the entire real support and the IDR 
estimator can be applied. Again the reparameterization calls for the appropriate 
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change-of-measure Jacobian when evaluating the corresponding transformed den- 
sity (see Aitchinson (1986) for details). 



6 Numerical examples and comparative perfor- 
mance 

In this section the successful implementation of the IDR estimator is illustrated with 
data simulated from some typical phylogenetic models. 

Here IDR method has been applied using the MCMC output of the simulations 
from the posterior distribution obtained using the MrBayes software; the likelihood 
has been computed using the R package PML while the reparameterization on 
and IDR perturbation gp k (0) have called for specifically developed R functions. The 
whole R code is available upon request from the first author. 

Two of the simplest and most favorite model evidence output in the publicly 
available software are used as benchmarks: the Harmonic Mean estimator and the 
Arithmetic Mean estimator. Indeed, while the former is guaranteed to be a consis- 
tent estimate of the marginal likelihood, though possibly with infinite variance, the 



latter one is consistent only when formula (3.6) is applied when 9i, 9t are sampled 



from the prior. Since it is known such a prior AM turns out to be very unstable and 
unreliable it has often been replaced by a posterior AM where 9\, ...,6t are sampled 
from the posterior rather than from the prior. In that case one must be aware that 
the resulting quantity can be interpreted only as a surrogate evidence in favor of one 
model and it should by no means be confused with the rigorous concept of marginal 
likelihood and related to Bayes Factor. We now show the performance of IDR in two 
phylogenetic examples. Simulated data is used to have a better control of what one 
should expect from marginal likelihood and the corresponding comparative evidence 
of alternative models. 

6.1 Hadamard 1: marginal likelihood computation 

We use as a first benchmark the synthetic data set Hadamard 1 already employed 



in Felsenstein (2004). It consists of a sequence 1000 of amino acid alignments of six 
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species, A, B, C, D, E and F simulated from a GTR + T model. The true tree is 
shown in the left Panel of Figure [2] 



Figure 2: True phylogenetic trees of the two synthetic data used as benchmarks: 
Hadamard 1 (left Panel) Hadamard 2 (right Panel). 



GTR + T model is implemented in MrBayes which uses the Metropolis Coupling 



algorithm (Altekar et al, 2004) and provides as output the simulated Markov Chain 



and some evaluations of Bayesian model evidence in terms of posterior AM and 
marginal likelihood via HM. 

The simulated Markov chain can be used as a sample from the posterior dis- 
tribution in order to make inference on the model parameters. For this model the 
whole parameter space consists of 18 parameters. In order to reduce the autocor- 
relation and improve the convergence of the 1100000 sampled values, 100000 have 
been discarded with thinning rate equal to 10. We have recycled this MCMC out- 
put to estimate the marginal likelihood of the GTR + T model for the known true 
topology through IDR method. In Table [3] we list the corresponding values of the 
IDR estimator on the log scale (log cidr), the Relative Mean Square Error estimate 
(RMSEc IDR ) as in Equation (4.5) and the confidence interval CI for different per- 



turbation masses k. In order to take into account the autocorrelation of the posterior 
simulated values, a correction has been applied to RMSEc IDR replacing n in (4.5) 
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with the effective sample size given by 

nEss = n x * 7 - (6.1) 

1 + 2 E i= i Pi 

Since the optimal corrected error RMSE* t IDR corresponds to a perturbation 
value k opt = 10~ 7 , the IDR estimator (on a logarithmic scale) for the GTR + T 
model is \ogc IDR = -7258.200. 

Table 3 about here 

We compare the results of the IDR method with those obtained with the Har- 
monic Mean (HM) and the posterior Arithmetic Mean (AM). 

For &hm and cam, relative errors have been estimated respectively as 



RMSE HM = c HM \l Var[ — ) (6.2) 



RMSE^ = ^-KlY^m ( 6 .3) 
cam V n 

Similarly to RMSE* £lDR we have denoted with RMSE*g HM and RMSE* g AM 
the estimates of the relative errors adjusted with the effective sample size uess- 

The three methods produce somewhat different quantities although sometimes 
compatible once accounted for the estimation error. For each method, the Monte 
Carlo relative error of the estimate has been computed re-estimating the model 10 
times (RMSEc,mc) and recording the corresponding different values of c. 
Although it is known that under critical conditions such MC error is not sufficient 
to guarantee its precision it still remains a necessary premise for an accurate esti- 
mate. We have also looked at another precision measurement RMSE^ Boot based on 
bootstrap replications. RMSE^Boot is defined as 



RMSEc,Boot 



\ 



l! ' 'c x 2 

d — V 7 " 1 



1 - 

7,11 



6=1 



where q, denotes the bootstrap replicate of the generic marginal likelihood estimate 
with one of the three alternative formulae cam, chm or cidr and B = 1000. In 
order to account for the autocorrelation, also RMSE^ Boot has been corrected using 



the effective sample size in (6.1). 



Table H] shows the obtained results. 
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Table 4 about here 



Indeed the estimates of the model evidence as well as the corresponding esti- 
mates of their relative errors are very different. In fact, the smallest relative error is 
obtained with the arithmetic mean of the likelihood values simulated from the pos- 
terior. We have already pointed out the fact that this posterior AM does not really 
aim at estimating the marginal likelihood, but we have nonetheless considered it in 
our examples to verify how distant the corresponding values are and how different 
the conclusions and their strengths can be when comparing alternative models via 
the posterior AM estimator. For the Harmonic Mean method the estimated MC 
relative error is formidably high and unstable resulting in a serious warning on the 
reliability of the HM estimator in this context. On the other hand, the Inflated Den- 
sity Ratio approach seems to be a good compromise in terms of order of magnitude 
of the error of the estimate cidr and robustness of its relative error estimation rang- 
ing from 0.15 with independent Monte Carlo re-estimation to 0.30 of the RMSE* d - 



6.2 Hadamard 2: Bayes Factor computation 



We have also considered the Hadamard 2 data in (Felsenstein 2004) as a second 
benchmark synthetic data set. It consists of 200 amino acids and four species, A, B, 
C, D. This dataset have been simulated from the Jukes-Cantor model (JC69) and 
the true tree is shown in the right Panel of Figure [2] 

For the true topology, we compute the marginal likelihood for the JC69 model and 
for the GTR+r model: parameters lie in a 5-dimensional space for the JC69 model 
and in 14-dimensional space for the GTR+r model. The simulated values from the 
Metropolis-Coupled algorithm have been rearranged to evaluate the model evidence 
of both models using IDR, HM and the AM approach. As for the Hadamard 1 data, 
Monte Carlo RMSE have been also computed by repeating the estimates 10 times. 
Table [5] shows the results obtained respectively for the GTR+r and the JC69 
models. 

Table 5 about here 

All methods produce somewhat different results in terms of model evidence; as 
for the previous example, the smallest relative Monte Carlo error is associated once 
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again with the Arithmetic Mean method. Also in this case, estimates of the relative 
errors of the Harmonic Mean method are always larger than those produced by the 
Inflated Density Ratio method. The corresponding Bayes Factors (on logarithmic 
scale) for JC69 and GTR+r are shown in Table [6] Considering the reference values 



for the Bayes Factor defined in Kass and Raftery (1995), all methods consistently 



give support to the Jukes-Cantor model, which is known to be the true model. 
However, we highlight that the strongest evidence in favor of the correct model 
corresponds to the Bayes Factor as estimated by the Inflated Density Ratio . 

Table 6 about here 
6.3 Hadamard 2: tree selection 

We now show how it is possible to extend the IDR approach for dealing with selecting 
competing trees when the topology is not fixed in advance. For a fixed substitution 
model, competing trees can be compared by considering the evidence of the data 
for a fixed tree topology. The evidence in support of each tree topology Tj G N T can 
be evaluated in terms of its posterior probability p(n\X) derived from the Bayes 
theorem as 

Y JT eN T p\ x \ T )' K y T ) 

where the experimental evidence in favor of the model M Ti with fixed tree topology 
Tj is contained in the marginal likelihood 



m{X\M Ti ) =p{X\n) = / ■p(X\u i ,M n )>K(u i \T?)di J j i 

Jn Ti 

where the continuous parameters G £l n of the evolutionary process corresponding 
to M Ti are integrated out as nuisance parameters. Indeed when prior beliefs on trees 
are set equal 7t(t,) = irfa) comparative evidence discriminating Tj against Tj, is 
summarized in the Bayes Factor 

m(X\M T .) , . 

= Mil- (6 - 4) 

We have considered the problem of selecting competing trees of a substitution 
model for Hadamard 2 data. In the previous Subsection, we have verified the feasi- 
bility of the Inflated Density Ratio approach in comparing JC69 with the GTR+r 
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model. Under a fixed topology JC69 was favored. Indeed we know that Hadamard 
2 data was simulated from JC69 model with true topology labeled as r = 1. Now 
we aim at comparing Nt = 3 alternative topologies within the JC69 model and we 
compute and compare the corresponding marginal likelihoods. Results are shown in 
Table 

Table 7 about here 

Also in this case, the IDR exhibits the most convincing performance in terms of 
evidence in support of the true tree as well as precision and robustness of estimates. 



7 Discussion 



In this paper, we investigate the possibility of using simple effective recipes for 
evaluating model evidence of competing models of complex phylogenetic models. In 
a Bayesian framework, several methods have been proposed in order to approximate 
the marginal likelihood of a single model and then eventually estimate the Bayes 
Factor of two competing models. 

Probably, the most widely used methods to date are the thermodynamic integra- 
tion and the harmonic mean approach. The thermodynamic integration has been 
proved to produce more reliable estimates of Bayes Factors of competing phyloge- 
netic models in a large varieties of contexts. Although this method has the advantage 
of general applicability, it is computationally demanding and may require fine tun- 
ings and adjustments. Indeed, the simplicity of implementation combined with a 
relatively light computational burden are two appealing features which explain why 
the HM is still currently one of the most favorite option for routine implementa- 
tion (see von Reumont et al. (2009)). However, the simplicity of HM is often not 



matched with its accuracy and recent literature is highlighting unreliability of HM 



estimators in phylogenetic models (Lartillot and Philippe, 2006) as well as in more 



general biological applications (Calderhead and Girolami, 2009). In this paper, we 



have provided evidence of improved effectiveness of a simple alternative marginal 
likelihood estimator, the Inflated Density Ratio estimator (IDR), belonging to the 
class of generalized harmonic mean estimators. It shares the original simplicity and 
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computation feasibility of the HM estimator but, unlike it, it enjoys important the- 
oretical properties, such as the finiteness of the variance. Moreover it allows one 
to recycle posterior simulations which is particularly appealing in those contexts - 
such as phylogenetic models - where the computational burden of the simulation is 
heavier than the evaluation of the likelihood, posterior densities and the like. Like 
all importance sampling techniques based on a single stream of simulations the com- 
putational burden can be shared in a parallel computing environment reducing the 
computing time. Also the grid search for optimizing the estimated RMSE can be 
speeded up with a parallel evaluation for each inflated density. 

We have verified the effectiveness of the IDR estimator in some of the most com- 
mon phylogenetic substitution models under different model complexity including 
mixed parameter space and evaluated the comparative performance with respect to 
HM and posterior AM estimators. In all circumstances the IDR estimator outper- 
formed the HM estimator in terms of precision and robustness of the estimates and 
it is then an interesting candidate to be included in standard software as a simple 
and more reliable model evidence output. Its simple implementation makes the IDR 
estimator a useful tool to be possibly used as a simple confirmation/benchmark even 
in those models where fine-tuned approximation tools such as thermodynamic in- 
tegration are available and, when appropriately fine-tuned, may yield more precise 
estimates. 
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Software 


Marginal likelihood estimation method 


BayesTraits 


Harmonic Mean 


BEAST 


Harmonic Mean or bootstrapped Harmonic Mean 


MrBayes 


Harmonic Mean 


PHASE 


Harmonic Mean and Reversible Jump 


PhyloBayes 


Thermodynamic Integration under normal approximation 



Table 1: Availability of marginal likelihood estimation methods in Bayesian phyloge- 
netics software: in spite of its inaccuracy, the harmonic mean estimator is still one 
of the most diffuse marginal likelihood estimation tool. 



28 



Distribution 


log c 


log CiDR 


RMSE £lDR 


k opt 


N(h,<t) 








10" 4 


10~ 4 










0.01 


IQ2J, 


SN(n, a, t) 





10~ 4 


0.004 


10~ 4 


SN 5 (n,a, r) 


3,467 


3.444 


0.014 


20 


SN 30 (n,cr,T) 


2,302 


2.403 


0.047 


10 4 


Mix N 2 


2.079 


2.078 


0.003 


0.01 


Mix N 3 


2.772 


2.766 


0.002 


1 


Mix N 10 





0.056 


0.012 


2 



Table 2: Performance of IDR approach with n = 10 5 i.i.d. draws simulated from 
distributions with known normalizing constants: univariate and multivariate gaus- 
sian distributions (up to 100 dimension), univariate and multivariate skew normal 
distributions (SN) (30 dimension) and multivariate mixtures of two Normal compo- 
nents (Mix N ) (in dimensions 2,3 and 10). The value logc represents the logarithm 
of the true value of the normalizing constant; log cjbr is the value of the estimated 
normalizing constant on a logarithmic scale. k opt is the optimal inflation coeffi- 
cient which minimizes the Relative Mean Square Error RM SE^ IDR computed as in 



(4-5). For the SN case a small sensitivity study (not reported here) showed that the 



performance of the method is robust to different (fi, a, r) parameter choices. 
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k 




lnP 1 CT 7-1 D 


RAISE- 


RAISE* - 


CI 




10- 


10 


-7264.438 


0.1710 


0.4515 


[-7264.852, - 


•7263.718] 


10" 


-9 


-7262.150 


0.1689 


0.4514 


[-7262.560, - 


-7261.443] 


10" 


-8 


-7259.939 


0.1602 


0.3664 


[-7260.332, - 


-7259.284] 


(*) 10" 


-7 


-7258.200 


0.1178 


0.3008 


[-7258.503, - 


-7257.764] 


10" 


-6 


-7257.554 


0.1407 


0.3694 


[-7257.907, - 


-7257.006] 



Table 3: Inflated Density Ratio method applied to Hadamard 1 data with a GTR+T 
model with a 18- dimensional parameter space. IDR estimates on the log scale 
for a small regular grid of perturbation values. The relative mean square er- 
rors RMSEc IDR are computed as in 
RMSE* c IDR are the relative mean square errors corrected for the autocorrelation. 
CI are confidence intervals on a log scale. Since the smallest error in the grid cor- 
responds to a perturbation value k opt = 10~ 7 , the IDR estimator for the GTR + V 
model is taken to be logcinn = —7258.200. 



4-5) without accounting for autocorrelation. 
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Method log(c) RMS~E & RMS~E &MC RMSE t>Boot RMSE* a 

IDR -7258.200 0.1178 0.1538 0.1698 0.3008 

HM -8365.509 173.2080 > 10 10 > 10 100 296.3475 

AM -7204.245 0.0197 0.0119 0.0202 0.065 



Table 4: Hadamard 1 data: marginal likelihood estimates obtained with the Inflated 
Density Ratio method cjbr, with the Harmonic Mean approach ch^i and with the 
Arithmetic Mean approach cam. The estimates are based on n = 10 6 from 10 7 simu- 
lated values. Three different RMSE estimates are provided: RMSE?. has been com- 
puted as in (4-5) for IDR, (6.3) for AM and (6.2) for HM; RMSE^mc comes from 
10 Monte Carlo independent replicates of the estimation; RMSEc,Boot is a boostrap 
approximation of RMSE (B=1000); RMSE*c is the estimated RMSE corrected for 
the autocorrelation as in \6.1 ). 
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GTR+r 



Method log(c) RMSE & RMSE 6>MC RMSE LBoot RMSE* d 

IDR -611.8571 0.1153 0.1087 0.1175 0.3608 

HM -594.648 31.5330 0.1329 0.3488 141.3285 

AM -588.286 0.0184 0.0863 0.0187 0.0826 

JC69 



Method log{c) RMSE? RMSE^ MC RMSE %Boot RMSE* d 

IDR -595.5919 0.0068 0.0161 0.0081 0.0179 

HM -589.0289 34.6759 0.1415 0.6787 59.5918 

AM -589.4194 0.0057 0.0146 0.0056 0.0182 



Table 5: Hadamard 2 data: marginal likelihood estimates of the GTR+T model 
(u = $t 1A ) obtained with the IDR method cjdr, with the HM approach chm and 
with the AM approach cani. The estimates are based on n = 10 6 from 10 7 simulated 
values. Three different RMSE estimates are provided: RMSEc has been computed 
as in (Q) for IDR, (fojj for AM and for HM; RM~SE LMC comes from 10 

Monte Carlo independent replicates of the estimation; RMSELBoot is a bootstrap 
approximation of RMSE (B=1000); RMSE*c is the RMSE corrected for the auto- 



correlation as in (6.1). 
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Method 


log(BF 'cTR+v-jcm ) 


. — -MC 
CI log(BF) 


IDR 


16.2652 


[16.1726,16.3578] 


HM 


5.6241 


[5.0206,5.1546] 


AM 


1.1334 


[1.0617,1.2051] 



Table 6: Hadamard 2 data: Bayes Factors computed with IDR, HM and AM ap- 
proach. The estimates are based on n = 10 from 10 simulated values. CI log 0^p^ 
is the confidence interval obtained as log(BF) ± 2 • S D mc(1°9(B F)) . log(BF) is 
obtained by averaging the estimated Bayes Factors (on logarithmic scale) of 10 x 10 
possible pairings of 10 MC replicates. SD MC {log(BF)) is computed as the stan- 
dard error of the estimated Bayes Factors (on logarithmic scale) in 10 x 10 possible 
combinations of 10 MC replicates. 
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T 


Size 


CI log{BF IDR ) 


- — - MC 

CI log(BF H M)) 


— MC 
CI log(BF AM ) 


log(BF 12 ) 


10 4 


[3.511,3.599] 


[2.37,4.066] 


[2.929,2.989] 


log(BF 13 ) 


10 4 


[3.817,3.901] 


[2.503,3.163] 


[2.053,3.131] 



Table 7: Hadamard 2 data: the Bayes Factor is computed in order to compare com- 
peting topologies The Bayes Factor is approximated with the IDR method (BF IDR ), 
the HM (BF HM ) and the AM (BF AM ) approach. CI log 0^p^ is the confidence interval 
obtained as log(BF) ± 2 • S D Mci}°g{.B F)) . log(BF) is obtained by averaging the 
estimated Bayes Factors (on logarithmic scale) of 10 x 10 possible combinations of 
10 MC replicates. S D MC (log(B F)) is computed as the standard error of the esti- 
mated Bayes Factors (on logarithmic scale) in 10 x 10 possible combinations of 10 
MC replicates. 
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